Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

نویسندگان

  • Jonathan Shen
  • Ruoming Pang
  • Ron J. Weiss
  • Mike Schuster
  • Navdeep Jaitly
  • Zongheng Yang
  • Zhifeng Chen
  • Yu Zhang
  • Yuxuan Wang
  • R. J. Skerry-Ryan
  • Rif A. Saurous
  • Yannis Agiomyrgiannakis
  • Yonghui Wu
چکیده

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text-to-speech Synthesis System based on Wavenet

In this project, we focus on building a novel parametric TTS system. Our model is based on WaveNet(Oord et al, 2016), a deep neural network introduced by DeepMind in late 2016 for generating raw audio waveforms. It is fully probabilistic, with the predictive distribution for each audio sample conditioned on all previous samples. The model introduces the idea of convolutional layer into TTS task...

متن کامل

Speaker-Dependent WaveNet Vocoder

In this study, we propose a speaker-dependent WaveNet vocoder, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet. It is expected that WaveNet can learn a sample-by-sample correspondence between speech waveform and acoustic features. The advantage of the proposed method is that it does not require (1) exp...

متن کامل

Speech Enhancement Using Bayesian Wavenet

In recent years, deep learning has achieved great success in speech enhancement. However, there are two major limitations regarding existing works. First, the Bayesian framework is not adopted in many such deep-learning-based algorithms. In particular, the prior distribution for speech in the Bayesian framework has been shown useful by regularizing the output to be in the speech space, and thus...

متن کامل

Non-filter waveform generation from cepstrum using spectral phase reconstruction

This paper discusses non-filter waveform generation from cepstral features using spectral phase reconstruction as an alternative method to replace the conventional source-filter model in text-to-speech (TTS) systems. As the primary purpose of the use of filters is considered as producing a waveform from the desired spectrum shape, one possible alternative of the sourcefilter framework is to dir...

متن کامل

Improved Speech Synthesis Using Fuzzy Methods

The paper presents theoretical support for and describes the use of a fuzzy paradigm in implementing a TTS system for the Romanian language, employing a rule-based formant synthesizer. In the framework of classic TTS systems, we propose a new approach in order to improve formant trace computation, aiming at increasing synthetic speech perceptual quality. A fuzzy system is proposed for solving t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1712.05884  شماره 

صفحات  -

تاریخ انتشار 2017